Table of Contents



Only IBD Data

Libraries


Data


Pull in Data

ENDSC Data

Data Definitions

Data Manipulation

IBD Stages: Since the stages of UC is determined by the severity of symptoms, the classes are manually added based on symptoms.

Use decision trees to determine

Perhaps find a doctor who can provide some expertise into the stages? - check if this is possible (we would need multiple people to have statistically significance)

Data Transformation Since the data is already dummy coded, the transformation of it will be required for understanding the outcome after modeling.

Set Seed for consistency

Crypt architecture measures the severity of the deformation of the colon, which will also signify at what severity stage the cases are at. This is the column that will be used for determining cases severities.

convert data to object rather than int since these are categorical data.

Data Cleaning

Clean Diagnosis: Strip data and Upper Case and ensure spelling of all are correct to prevent any separation of classes which are unnecessary.

Missing/Duplicate Data Checks

There is no duplicates data

There are no missing data values

Train Test Split

Cross and coworkers randomly shuffled the dataset and split the first 540 cases as the train set and the lasts 269 cases as the test set.

Class Imbalance

The minority class of heatlhy was oversampled so that there were equal diseased as unhealthy classes. This is also reflected in graphs below.

EDA

Jamie


The age is skewed towards the younger generations, and there are outliers of age under 15 and above 85. Since there is no proof that these age groups are errors opposed to only having a low count, they will be left in the data.

The data below shows that majority of the cases are from years 90-92 and 95-96. The other years have minimal contribution for years prior to year 90.

While the data is a mixture of both histology and endoscopy, but majority of the confirmation methods are endoscopy.

Distribution of the dataset, where majority of the classes are UC and the remaining are split to normal and UC roughly evenly.

correlations

The correlation matrix is show below, which is no the same method which is used for continuous variable, but rather categorical variables.

We see strong correlations between the symptoms. Specifically, there is a strong correlation between active inflammation and lamina propria polymorphs, which is investigated further below.

Many of the correlations are intuitively connected. For example, cryptis polymorphs and extent, since they are both related to the the fact of where there is inflammation in the linings of the stomach to the morphed cells of the glands.

One interesting obervation is the correlation of epithelial changes and the mucin depletions since the epithelial layer concerns the outter layer of the intestine and the mucin depletion primarily concerns with the inner side of the organ.

Active inflammation and lamina propria polymorphs

Overall, the active inflamation makes sense considering if there is no inflammation, that there in turn would have no polymorphs. Since the inner linings are typically only shows to morph when there is inflammation, this is intuitive in the results.

Odds Ratio

Odds ratio is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure. source

There is no strong correlations between the two, that if a patient is of a specified year and age, there is a 1:1 ratio of the patient being diagnosed with UC of chrohns.

Reducing categorical classes Since there isn't a high number of classes in each categorical columns, there is no need to reduce the number of classes in a categorical set.

Walter


Each column in the dataset is a symptom. SOme of the symptoms are rankings. When the column for Subepithelial collagen is 1 it means that the patient had that symptom and when it is 0 it means the patient did not have that symptom.

Supervised Learning

Get only the binary variables

Calculate the relative risk ratio of having IBD if patient has or doesnt have Patchy lamina propria cellularity

What proportion of those with patchy lamina propria had Crohn's Disease?

How much more chance of getting Crohn's disease if you have patchy lamina propria cellularity VS if you dont have patch lamina prpria cellularity?

Observe above that the probabilit of getting Crohn's is twice as much if you have patchy lamina propria cellularity VS if you dont have patchy lamina.

Determine relative risk of Crohn's or UC for all the symptoms
Calculation will require creating 3 tables:

  1. Symptom, Is Symptom Present, Confirmed Diagnosis, Count
  2. Symptom, Is Symptom Present, Count
  3. Symptom, Is Symptom Present, Confirmed Diagnosis, Proportion
  4. Symptom, Confirmed Diagnosis, Relative Risk (Final Table)

Out of all the people that had Increased lamina propria cellularity, what percent of them had Crohn's disase?
In below table see that 22.5% of patients with Increased lamina propria cellularity had Crohn's disease.

Out of all the people that did NOT have Increase lamina propria cellularity, how many had Crohn's disease?

You have two people, one with increased lamina propria cellularity and the other one without increased lamina propria cellularity. How much more likely is the first person to have Crohn's disease compared to the second?

You have two people, one with increased lamina propria cellularity and the other one without increased lamina propria cellularity. How much more likely is the first person to have Crohn's disease compared to the second?

For unsupervised EDA, The objective is to find multiple symptoms that are all 1 for the same patients and are all 0 for other patients.

  1. First, manually calculate the risk ratio between Symptom A and Symptom B
  2. Next, create a cross tab where the row is Symptom A, the column is Symptom B and the cell value is the risk ratio of Symptom B / Symptom A
  3. Finally, find the groups of symptoms that have highest risk ratios for one another. If 3 columns have high relative risk ratios, consider keeping only one of those columns and dropping the other 2

Risk Ratio

What is the risk of getting "Increased lamina propria cellularity" if you do have "Lamina propria granulomas" versus the risk of getting "Increased lamina propria cellularity" if you do not have "Lamina propria granulomas"?

If two symptoms are both positive in 1000 patients. And in another 1000 patients the two symptoms are negative. This would indicate correlation between those 2 symptoms.

Get the cross tab of every symptom with every other symptom

In the below cross tab, the value in the second row, and in the fourth column (Incerased Lamina propria cellularity_1) is the number 0.894737. This means that 89% of the patients (in the train set) had both Basal histocytic cells and Increased lamina propria cellularity. Notice how this number 89% adds up tihe the 10.5263 % on the left of it. That 10% number is the proportion of patients that had basal histocytic cells but did NOT have icnreased lamina propria cellularity.

Which feature is most correlated with the other features?

Observe that "Increased lamina propria cellularity" and "Active Inflammation" are the columns that is most correlated with the other symptoms.

noExposureDf: Get all the risks of getting Symptom B given that you dont have symptom A.
exposureDf: Get all the risks of getting Symptom B given that you do have symptom A.

Divide all the risk-given-exposure/ risk-given-no-exposure to get the relative risk for every symptom pair

The relative risk from our risk matrix is the same as the one when we manually calculated it. 1.559171

Replace infinity values or abnormally high Relative risks with 0

Observe in heatmap below that Submucosal granulomas are highly correlated with lamina propria granulomas

Out of the 453 patients that did not have patchy lamina propria cellularity none of those patients also had lamina propria granulomas.
However, out of the 87 patients that had patchy laminap propria cellularity, 4 of those patients also had lamina propria granulomas.
It looks like these 2 columns are correlated.

Out of the 453 patients that did not have patchy lamina propria cellularity only 1 of those patients also had lamina propria granulomas.
However, out of the 87 patients that had patchy laminap propria cellularity, 12 of those patients also had lamina propria granulomas.
It looks like these 2 columns are correlated.

You have two people, one with increased lamina propria cellularity and the other one without increased lamina propria cellularity. How much more likely is the first person to have Crohn's disease compared to the second?

Model Assumptions

The model assumptions for all models are not concerning to the data for visualization. The main requirement is that the data doesn't have a linear correlation between features and that the data is independent, assumed by the unique data points.

Due to the data primarily being categorical, the modifications/assumptions are difficult to decipher.

Data Prep for Modeling

Dummy coding the Data

There are two different set methods, dummy coding and ordinal. Before converting to dummy code, the data is first returned to it original form, then dummy coded to understand the effects of feature reduction and whether its required.

for column "Initial pathologists diagnosis_?IBD ?Infective", there is only one instance of this observation. Due to this we will drop the column as it will error during analysis.

Ordinal Data

Ordinal data is the method of which the data is already set up in. This allows the researchers to put the remaining data types into an ordinal set up for analysis.

Of the two differing methods, one of the two will be selected for analysis.

Train/Test Split

Max-Min Transformation

SMOTE

Feature Importance

Chi-squared is used for determining feature importance. source

ETL PipeLine

confirmed to not be needed considering time constraint.

Machine Learning Models


Parameter Tuning

Creation of novel ML Models

Running Models

Results

Statistics

Visualizations

Discussion

Conclusions